Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Fix DataFrame.to_xarray doctests and allow the CI to run it. #22673

Merged
merged 5 commits into from
Sep 19, 2018

Conversation

Moisan
Copy link
Contributor

@Moisan Moisan commented Sep 12, 2018

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Fix the docstring for DataFrame.to_xarray as described in #22459 . I removed the pandas.panel example since it was deprecated and I replaced it with a MultiIndex.

@pep8speaks
Copy link

pep8speaks commented Sep 12, 2018

Hello @Moisan! Thanks for updating the PR.

Comment last updated on September 14, 2018 at 20:39 Hours UTC

@gfyoung gfyoung added Docs DataFrame DataFrame data structure labels Sep 12, 2018
Return the xarray equivalent of the pandas object. `xarray
<http://xarray.pydata.org/en/stable/>`__ is a
Python package that allows to handle N-dimensional data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to mention this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to extend the previous short summary. Do you suggest to simply remove it or is there something else we should mention here?

Copy link
Member

@gfyoung gfyoung Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it. We describe xarray elsewhere in the docs anyhow. You don't always have to put an extension of the short summary.

@codecov
Copy link

codecov bot commented Sep 12, 2018

Codecov Report

Merging #22673 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22673   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <ø> (ø) ⬆️
#single 42.35% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.67% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73dd6ec...a7ecbb2. Read the comment docs.

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* items (items) object 'A' 'B' 'C' 'D'
* major_axis (major_axis) datetime64[ns] 2013-01-01 2013-01-02 2013-01-03 # noqa
* minor_axis (minor_axis) object 'first' 'second'
* first (first) object 'bar' 'baz' 'foo' 'qux'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to have a datetime index for 1 level

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @Moisan

I added some comments about the original docstring, that if you implement them, I think the examples won't only pass the tests, but will be much clearer.

@@ -2498,11 +2498,15 @@ def to_xarray(self):
a Dataset for a DataFrame
a DataArray for higher dims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind replacing this to the standard format? Only the type in the first line, and a description in the next. For example:

xarray.DataArray or xarray.Dataset
    Data in the pandas structure converted to Dataset if the object is a DataFrame, or a DataArray if the object is a Series.

@@ -2498,11 +2498,15 @@ def to_xarray(self):
a Dataset for a DataFrame
a DataArray for higher dims

See also
--------
DataFrame.to_csv : Write out to a csv file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Also is with a capital A.

Personally I don't like "recommending" to_csv here, as besides not being a great format, it does not support multidimensional data. I think to_parquet and to_hdf seem more appropriate to me.

).set_index(['B','A'])
... 'B' : ['foo', 'bar', 'foo'],
... 'C' : np.arange(4.,7)}
... ).set_index(['B','A'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need to repeat the previous DataFrame, we can just have df_multiindex = df.set_index(['B', 'A'])

'B' : ['foo', 'bar', 'foo'],
'C' : np.arange(4.,7)})
... 'B' : ['foo', 'bar', 'foo'],
... 'C' : np.arange(4.,7)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using arange makes the code more complicated for no reason. Using [4., 5., 6.] is simpler and clearer I think.

Also, it'd be good to have a meaningful example. The code in the example is difficult to follow as there is no way to know that the object column is B, more than looking at the example. If we get something few samples with animals with name, num_legs, speed, it's obvious which is the float, which the int, and which the str/object.

@datapythonista datapythonista added the IO Data IO issues that don't fit into a more specific label label Sep 12, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really great. Just a couple of comments that if I'm right, should simplify the examples.

... ('monkey', 'mammal', np.nan, 4)],
... columns=['name', 'class', 'max_speed',
... 'num_legs'],
... index=[0, 2, 3, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we don't use the default index (so we don't specify it), or we specify one sorted? May be I'm missing the point, but seems like this should have a meaning, but couldn't see with the rest of the example. If there is no reason (may be you just copied from an example where this was for something?), I'd just remove it, so we save some space and avoid distractions.

The indentation of the num_legs seems wrong, I think it should be indented to the level of name. When possible we'll start validating automatically PEP8 in the examples, so if we can get this fixed already, that would be great.

... names=['first', 'second'])

>>> s = pd.Series(np.arange(8), index=index)
>>> s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this too complicated for what we need to show. To have a Series with a multiindex with a datetime level, we can have something like:

import pandas as pd

df = pd.DataFrame({'date': pd.to_datetime(['2018-01-01', '2018-01-01', '2018-01-02', '2018-01-02']),
                   'animal': ['falcon', 'parrot', 'falcon', 'parrot'],
                   'speed': [350, 18, 361, 15]}).set_index(['date', 'animal'])
df['speed']

I haven't used much xarray myself, and not sure what makes sense to show here. May be:

  • Series.to_xarray()
  • DataFrame.to_xarray()
  • DataFrame(with multiindex including datetime).to_xarray()

If that makes sense, I think with the first example, we can have df.to_xarray() and df['max_speed'].to_xarray(), and then a example like the one I wrote.

@jreback does this make sense?

Sorry for requesting the changes @Moisan, but my I find like the current version gives the idea that we're trying to show something more complex than what we are actually showing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I'm happy to make the examples more relevant :).

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @Moisan

@jreback I think you comments should be addressed properly, let us know if that's not the case

@TomAugspurger TomAugspurger merged commit d923385 into pandas-dev:master Sep 19, 2018
@TomAugspurger
Copy link
Contributor

Thanks!

@Moisan Moisan deleted the docstring_to_xarray branch September 19, 2018 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Docs IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants